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When people play a repeated game they usually try to anticipate their opponents' moves based 
on past observations, and then decide what action to take next. Behavioural economics studies 
the mechanisms by which strategic decisions are taken in these adaptive learning processes. We 
here investigate a model of learning the iterated prisoner's dilemma game. Players have the choice 
between three strategies, always defect (ALLD), always cooperate (ALLC) and tit-for-tat (TFT). 
The only strict Nash equilibrium in this situation is ALLD. When players learn to play this game 
convergence to the equilibrium is not guaranteed, for example we find cooperative behaviour if 
players discount observations in the distant past. When agents use small samples of observed moves 
to estimate their opponent's strategy the learning process is stochastic, and sustained oscillations 
between cooperation and defection can emerge. These cycles are similar to those found in stochastic 
evolutionary processes, but the origin of the noise sustaining the oscillations is different and lies in 
P5 the imperfect sampling of the opponent's strategy. Based on a systematic expansion technique, we 

are able to predict the properties of these learning cycles, providing an analytical tool with which 
' ' the outcome of more general stochastic adaptation processes can be characterised. 
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^ I. INTRODUCTION 

^ The mathematical theory of games goes back to von Neumann and Morgenstern [T], and was initially 

£^ concerned with the study of equilibrium points [H |3]. The idea that players would be able to compute 
^ such equihbria requires severe assumptions, in particular perfect rationality and full knowledge of the game. 
^ Additionally each player has to assert that all other players are rational as well. Von Neumann and Morgenstern 
stress the limitations of their approach explicitly: 'We repeat most emphatically that our theory is thoroughly 
static. A dynamic theory would unquestionably be more complete and therefore preferable. ' [1] . 

Since the work of von Neumann and Morgenstern more than 70 years ago, several different routes have 
Oh been taken to formulate a dynamical theory of games. Evolutionary game theory was launched by Maynard 
Smith in the 1970s and considers time-dependent dynamics of populations of players [U [S] . Each individual 
in the population carries a pure strategy, inherited from its parent (s), and agents then reproduce and pass on 
^ their strategies to their offspring, with a reproduction rate depending on the performance in the game. The 
00 strategic content of the population evolves, with the concentration of successful strategies increasing over time, 
C and those of less successful strategies being reduced. Evolutionary game theory has been used to model a vast 

number of phenomena in the social sciences and in economics [6l412j. 

These applications include in particular the study of the emergence of cooperation and altruism |13) . The 
evolution of cooperative behaviour under selection pressure constitutes a formidable puzzle. The dynamics 
of evolution is governed by a fierce competition between individuals, and only those who act in their own 
interest and who selfishly promote their own evolutionary success at the expense of their competitors should 
prevail in the long-run. Nevertheless altruism and cooperative behaviour are found in a number of evolved 
systems, ranging from cooperating genes or cells to cooperating animals or humans in social contexts |14H16j . 
The question how cooperative behaviour has evolved under strictly competitive and selective dynamics is still 
unresolved, and has recently been listed as one of the 125 big open problems in science [17]. 

Our goal here is to address the emergence of cooperation in a third approach to game theory. We focus on 
adaptive learning processes of a small fixed set of individuals, who interact repeatedly in a game [TBH23]- Players 
observe their opponents' actions and aim to react dynamically by adapting their own strategic propensities, 
learning from past experience. Such learning models are of particular importance for the understanding of 
experiments in behavioural game theory, where human subjects play a given game repeatedly under controlled 
conditions, see e.g. pTHSS] . A-priori it is not clear whether adaptation will converge to Nash equilibria. 
Learning has for example been seen to fail to converge in games with cyclic payoff structures, and complex 
trajectories including limit cycles, quasiperiodic motion and Hamiltonian chaos have instead been identified 
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Mathematical models of cooperative behaviour are often based on stylised games played by a small number of 
interacting individuals, each choosing from a small number of strategies. Such games have been characterised 
as 'mathematical x-ray [s] of crucial features ' of Teal-word situations |22| . The most basic setup is the celebrated 
prisoners' dilemma, a game in which two players have the choice between cooperation and defection. Defection 
dominates cooperation in this game, no matter what the other player decides to do, either player will always do 
better defecting than cooperating. Fully rational players hence end up playing the only equilibrium strategy, 
defection, and have to put up with the a suboptimal payoff, when they could have scored higher had they both 
cooperated. 

If the prisoners' dilemma is iterated, more complex behaviour is possible and the space of all strategies 
grows rapidly as the number of iterations is increased. In order to make progress it is therefore necessary 
to restrict the mathematical analysis to a subset of this space. We will focus on three strategies: always 
defect (ALLD), always cooperate (ALLC) and tit-for-tat (TFT). Players using the TFT strategy cooperate 
in the first iteration and then proceed by playing whatever the opponent played in the previous round. The 
replicator- mutator dynamics of populations of players engaging in this game have been studied in [301 131j . 
ALLD has been identified as the deterministic replicator fixed point, and mutation has been seen to move the 
attractor toward cooperation. Demographic noise in finite populations can alter the dynamics and can induce 
coherent evolutionary cycles between defection and cooperation. 

As one main result we show that the effects of memory-loss in the learning dynamics are very similar to those 
of mutation in evolutionary dynamics. While deterministic learning in the absence of memory loss converges 
to ALLD, this Nash equilibrium is no longer an attractor when players discount observations in the distant 
past, and a different fixed point, involving all three pure strategies, emerges. Deterministic replicator-type 
equations are a faithful description of the learning process if and only if a large number of observations of 
the opponent's actions is made before players update their own strategic preferences. If, on the contrary, 
adaptation occurs more frequently and is based only a small sample of observations, the dynamics becomes 
stochastic. The source of randomness lies in the imperfect sampling of the opponent's mixed strategy profile. 
When each player uses a small number of observed actions to estimate the opponent's mixed strategy, then 
the estimate will generally be subject to statistical errors. The observed actions were chosen according to the 
opponent's mixed strategy profile, but still they are random variables. This source of noise different from the 
origin of demographic noise in the evolution of finite populations. Nevertheless the effects are similar: as our 
second main result we show that sustained cycles between cooperation and defection can emerge in stochastic 
learning, similar to those found in evolutionary scenarios of the iterated prisoner's dilemma game [30^ We are 
able to predict the characteristic frequency and power spectra of these cycles analytically as a function of the 
parameters of the game and the learning dynamics. 

II. MODEL 

To define the iterated prisoner's dilemma we will follow the notation of [30 . Assuming that m iterations of 
the prisoner's dilemma are played in any one interaction of the two players, and that a complexity cost c is 
associated with playing TFT the payoff matrix is given by 
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i.e. a player playing ALLC will for example receive a payoff of R (per round) when meeting another ALLC 
player, a payoff of S when playing against ALLD and a payoff of R upon encountering TFT. We will denote 
the payoff matrix elements as Ojj, where i, j = 1, 2, 3 label the strategics ALLC, ALLD and TFT respectively. 
Throughout this paper we use T = 5, i? = 3, P = 1, 5 = 0.1, to = 10, c = 0.8. 

In our model the game is played repeatedly by two players Alice and Bob. We will assume that Alice carries 
a (time-dependent) mixed strategy profile x(t) = {xi(t), X2(t), x^^t)) and similarly Bob's mixed strategy profile 
at t is y(t) = {yi{t),y2{t),yz{t)). We will write i{t) for Alice's action at time i, and j{t) for Bob's action, 
i.e. i{t), j{t) e {ALLC, ALLD, TFT}. Following pTH23] each player keeps attractions for each of the pure 
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strategies. Alice's attractions at time t are labelled by Ai{t) and Bob's attractions by Bj{t). We will again 
follow [2TI - f23] as well as [27H29] and assume that attractions determine choice probabilities through a logit 
rule, i.e. that the probabilities for Alice and Bob to play the different pure strategies at time t are given by 

The variable /? is a model parameter, and describes the intensity of selection or response sensitivity 23 . For 
/? — > oo the players strictly choose the pure action with highest attraction, for (5 — {) they play at random. 
We will here restrict the discussion to models in which both players use the same intensity of selection, 
generalisation to heterogeneous intensities is straightforward. 

A simple rc-inforcement learning dynamics is then defined by the following update rules for the attractions 

Ak{t + l) = (l-A)Afe(t) + afcj(t), 

Bk{t + l) = [l - \)Bk{t) + ak.,(ty (2) 

Alice's attraction Ak is therefore re-inforced by the payoff o,k,j(i) she would have received at time t had she 
played action k, and similarly for Bob. The parameter A indicates memory loss, observations in the distant past 
carry a lesser weight than more recent rounds. For A = the players have perfect memory of past play, and use 
the outcome of all past rounds with equal weight to determine their attractions. In particular A}, for example is 
then the total payoff Alice would have received had she always played action k e {ALLC, ALLD, TFT}, given 
Bob's moves. For A > experiences in the past are discounted exponentially. This may happen voluntarily 
as part of a learning mechanism or simply be due to fading memories and limited mental capacities. We will 
occasionally refer to A as a memory-loss rate or discounting factor. We assume that both players learn at 
identical memory-loss rates, generalisation to heterogeneous learning rules (AAjice 7^ Agob) is straightforward. 
Up to relabelling this learning rule is a special case of experience-weighed attraction learning, as discussed in 
[22I [23] . More general learning dynamics are discussed in the appendix. 



The process defined by Eqs. (Tp) is intrinsically stochastic, the actions i{t) and j{t) are drawn from the 
mixed strategy profiles x(t) and y(t) respectively, and accordingly the attractions Ak{t) and Bk{t) are random 
variables as well. Simple averaging, taking into account that i{t) takes the value i{t) = € with probability Xi{t) 
and that jit) — I with probability ye(t), results in the following average attraction update 

3 

Ak{t+1) = {l~\)Akit)+Y, akiVeit), 

e=i 
3 

Bk{t+1) = {l~\)Bk{t) + Y, auxtit). (3) 



Limiting dynamics of this type can provide insight into the expected outcome of learning. Deterministic 
learning has been shown to lead to modified replicator equations in a continuous-time limit Analyses 
of discrete-time deterministic learning can be found in f32l . The derivation the deterministic dynamics relies 
on an adiabatic approximation though, it is assumed that strategy updates occur on a much slower time scale 
than the actual play. In order to perform the update of Eq. ([S]) Alice has to have full knowledge of Bob's mixed 
strategy y(t), and Bob needs to be aware of Alice's strategy x(t). This will generally be very hard to achieve 
for the players. Eqs. ([3| are therefore only an approximate description of the learning process, and can at best 
be expected to describe the average behaviour. Describing learning in terms of these deterministic equations 
is procedurally akin to describing the average behaviour of evolving populations by means of deterministic 
replicator equations. To understand the nature of the approximation underlying the deterministic limit it is 
instructive to interpolate between the deterministic average process and the actual stochastic dynamics. We 
here consider a batch learning process, in which each player samples N actions of their respective opponent, 
and then updates their attractions. The above 'adiabatic' approximation consists in assuming stationarity of 
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the mixed strategy profiles between attraction updates. Specifically we introduce the following process 



The interpretation of these update rules is as follows: at time t Alice independently selects N actions iai^r) 
{a = 1,...,N) following her mixed strategy profile x(t) at that time. I.e. the {«q(t)} are independent 
random variables, and for each a one has Zq(t) = £ with probability xi{t). Bob draws his actions ja{T) in 
a similar manner, using his mixed strategy y(T). These actions represent the moves made by the two players 
in N successive rounds of the game, the mixed strategies x(r) and y(T) are kept fixed during the course 
of these rounds. At the end of the batch of N rounds both Alice and Bob update their attractions based 
on Eq. Q, and then adapt their mixed strategy profiles using Eq. ([T]) (with t replaced by r). We have 
intentionally used the notation r rather than t to denote time steps of this batch dynamics. One unit of time 
T corresponds to N repetitions of the game, i.e. to N units of time t. We will refer to N , the number of 
observations made in between updates of the attractions, as the batch size, following the language of machine 
learning [33j. Small batch sizes N correspond to fast adaptation. If = 1 we recover the original dynamics 
([2]) where strategy updates are performed after every single round of the game. Large N on the other hand 
indicate infrequent adaptation, the limit of infinite batches leads to the deterministic update rule Eq. 
This limit is based on the assumption that the mixed strategy profiles x(t) and y(r) are stationary during 
each batch of N repetitions of the game. This assumption will be irrelevant at small batch sizes N , but 
more severe in the limit of large N . Taking the limit Af — > 00 to derive the deterministic learning rule is 
analogous to the procedure leading to a description of evolving populations in terms of deterministic replicator 
equations. In evolutionary systems these descriptions are accurate for populations with an infinite number of 
individuals. Stochastic corrections cannot be neglected in finite populations, and the resulting noise has been 
seen to alter the dynamics substantially, see e.g. |301I31| . Similarly, real- world players do not operate adiabatic 
learning dynamics, but instead small batch sizes A^ are probably more appropriate to describe experiments in 
behavioural economics. It is therefore important to go beyond the deterministic limit of Eq. ([3| and to study 
stochastic effects at finite batch sizes. First steps have been taken in [33], and it is one of the main purposes 
of this work to apply these ideas to the iterated prisoner's dilemma game. 



We illustrate the outcome of the continuous-time deterministic learning (see appendix) in Fig. [Tj At low 
memory-loss rates the dynamics is essentially governed by the standard replicator equations, and the system 
has a single stable fixed point near ALLD, similar to what is reported for low mutation rates in evolutionary 
systems [SO]- As the memory- loss rate is increased ALLD remains a stable attractor, but cyclic attractors 
around an unstable fixed-point emerge (top right panel of Fig. [T]). At even higher memory loss this second 
fixed point becomes a stable spiral. Provided players do not discount past play too strongly this spiral fixed 
point is located in the vicinity of the ALLC/TFT edge of the strategy simplex, and we conclude that moderate 
memory loss may enhance cooperative behaviour. When the memory becomes even shorter the fixed point 
moves towards the centre of the simplex. In the extreme case of full memory-loss A = 1 players ignore the past 
history beyond the last iteration entirely. Depending on the response sensitivity both players play essentially 
at random, the three strategies are used with very similar frequencies. 

It is interesting to note that the outcome of deterministic learning with memory-loss resembles the behaviour 
of replicator- mutator dynamics of this game |30j . Discounting past experience in learning and mutation in 
evolution both promote cooperation when they are moderate in strength. The attractors of learning with 
quick memory loss on the other hand are similar to those of evolutionary systems in which mutation dominates 
selection. 

We will now move to learning at finite batch sizes A^. Players are then no longer able to obtain a perfect 
sample of their opponents' mixed strategy profile before updating their own strategic propensities, and the 
dynamics becomes stochastic. Results of numerical simulations are shown in Fig. [2j We here focus on a regime 
in which deterministic learning approaches a fixed point. Stochastic learning at the same discounting rate and 
intensity of selection results in sustained cycles between cooperation and defection. The amplitude of these 
cycles is found to scale as N~'^/'^ in the batch size, but the coefficient multiplying N~^^'^ can be substantial 



Ak{T+l) = (l-A)Afe(T) + -^ afe,»„(r), 




(4) 



a=l 



III. RESULTS AND DISCUSSION 
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(see appendix) so that the oscillations can have a significant amplitude. The inset of Fig. 2] confirms that 
average of several independent runs of the stochastic dynamics is accurately described by the deterministic 
update rules of Eqs. 

The cycling behaviour of the stochastic learning process can be understood as the result of an amplification 
mechanism, which turns intrinsic white noise into coherent oscillations |35j . The intuitive picture is here as 
follows: at the memory-loss rate chosen in Fig. [2] the deterministic dynamics spirals into a stable fixed point 
asymptotically, the relevant eigenvalue of the dynamics is complex. If an instantaneous perturbation were 
applied to the deterministic system at the fixed point, the dynamics would return to the fixed point following 
a trajectory of damped oscillations. At finite batch sizes, however, the dynamics is subject to persistent random 
fluctuations, constantly driving the system away from the fixed point. The combination of this permanent 
'excitation' and the oscillatory relaxation results in a coherently maintained cyclic pattern. 
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FIG. 1: Illustration of the behaviour of the deterministic continuous-time learning (see appendix) at different memory- 
loss rates A. Intensity of selection is /3 = 0.01. 

Similar noise-induced oscillation phenomena have been observed in various individual-based models of pop- 
ulation dynamics, evolutionary game theory, epidemics and biochemical reactions, see e.g. |5TII55^HD] . While 
the mechanism of resonant amplification in stochastic learning is analogous to the one observed in population- 
based models, the origin of the noise is different. In the individual-based models, large but finite populations 
are considered. Deterministic mean-field equations can then be derived in the limit of infinite populations. 
In finite populations, the dynamics remains stochastic, due to the random nature of the interactions on the 
microscopic level. The resulting noise scales with the inverse square root of the system size, and has been 
termed 'demographic stochasticity' [SHUT]. In the learning model the source of the noise is the inaccuracy 
with which players sample their opponent's strategy profile at finite batch sizes, and the amplitude of the noise 
and of the resulting quasi-cycles is proportional to the inverse square root of the batch size N. 

We have used a systematic expansion in the inverse batch size to characterise these cycles further (see 
appendix). These methods are similar to system-size expansions widely used in population-based models 
[42j . even though the expansion parameter in the learning dynamics is the inverse batch size, not the size of 
the population. Simpler games have been studied with this technique in [34j . These expansion methods are 
accurate for large, but finite batch sizes. As seen in Fig. [Sjthe power spectrum of the coherent oscillations can 
be predicted analytically with great accuracy for moderate and high batch sizes N. The agreement for batch 
sizes of = 10 is still reasonable, systematic deviations are only found if the number of observations between 
strategy updates is reduced further. 

To characterize the outcome of the stochastic learning process in more details we show the resulting sta- 
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FIG. 2: (Color on-line) Sustained oscillations in the stochastic dynamics. Frequency with which TFT is played by 
Alice as a function of time at A'^ = 10 observations between adaptation events. The horizontal line is the fixed point of 
deterministic learning. The inset shows the frequencies of ALLC, ALLD and TFT in the initial phase of the dynamics. 
Solid lines are the outcome of deterministic learning, symbols show data from an average over 100 independent runs of 
stochastic learning at A*" = 10. Model parameters are /3 = 0.1, A — 0.01. 
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FIG. 3: (Color on-line) Power spectrum of the frequency with which TFT is played. Horizontal axis shows the angular 
frequency uj, vertical axis the spectrum of fluctuations about the deterministic fixed point. Results from numerical 
simulations of the stochastic dynamics are shown (markers) along with the curve predicted by the theory in the limit 
of large, but finite batch size. Power spectra have been re-scaled by the inverse batch size, see appendix. Model 
parameters are /3 = 0.1, A — 0.01. Simulations are averaged over 1000 runs. 



tionary distributions in strategy space in Fig. [4j The panels in the upper row correspond to a memory-loss 
parameter for which the deterministic dynamics has a cyclic attractor. At small batch sizes stochastic learning 
essentially covers the entire strategy simplex, with the exception of the region near the ALLD/ ALLC edge. 
Surprisingly, the most frequently visited points in strategy space are found along the ALLD/TFT edge, the 
Nash strategy ALLD is played only very rarely. At larger batch sizes the dynamics concentrates in a region 
about the deterministic cycle. At fast memory-loss (lower row of Fig. |4]) deterministic learning has a fixed 
point. Again, the stochastic dynamics reaches almost the full strategy space for small batch sizes, but more 
and more concentration on the deterministic attractor is found as the frequency of adaptation is lowered (i.e. 
when the batch size is increased). In all cases shown in Fig. |4]the time-average of learning is found near TFT, 
defection occurs only rarely. 
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FIG. 4: (Color on-line) Frequencies of visits of the stochastic learning dynamics. Crosses mark the time average of the 
stochastic dynamics, black lines the trajectory or attractor of the deterministic discrete-time map. Data is obtained 
from 100 runs of the stochastic process at an intensity of selection /3 — 0.1. Colours indicate the frequency with 
which different regions are visited, a binning of strategy space is performed, and e.g. yellow stands for the 10% most 
visited bins, orange for the next 10% and so on (see legend). Grey areas are not visited by the dynamics at all in our 
simulations. 



To summarise we have here analysed in detail the learning dynamics of two fixed players interacting in a re- 
peated prisoner's dilemma game. We find that discounting past experience in a deterministic learning produces 
behaviour very similar to the dynamics found in evolutionary replicator- mutator systems |30j . Memory loss 
removes the stability of the ALLD fixed point, and leads to attractors near the ALLC/TFT edge of strategy 
space. In order to go beyond the adiabatic assumption underlying purely deterministic adaptation models, we 
have also addressed more realistic stochastic learning. Here, players update their strategic propensities more 
frequently, relying on an imperfect sampling of their opponent's strategy. We then observe persistent stochas- 
tic cycles, with a time average concentrated near TFT, paralleling earlier observations in finite evolutionary 
dynamics [301. Based on a systematic expansion technique we have characterised these cycles analytically. 
This method is applicable very generally, and can be used to study the effects of stochasticity in other learning 
models [HI |22] , in machine learning problems and in algorithmic game theory [43] . Cyclic behaviour has been 
reported in experimental studies of multi-player learning [24] . we expect that the techniques we have intro- 
duced will be helpful in formulating and calibrating theoretical learning models describing these real-world 
laboratory experiments. 
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Appendix 

1. Deterministic dynamics and modified replicator equations 

The limiting deterministic dynamics, obtained for — >■ oo, is given by Eqs. (3) Taking into account Eqs. 
(1) one can then write the update rule solely in terms of x and y and finds the following map |28j 



X^{t+l) 



Taking a continuous-time limit of (3), as discussed in |29j . one finds 

Ak = -XAk + y^afcjl/j, 

j 

Bk = -XBk + ^akiXi. (A6) 

i 

Using Eqs. (1) it is then straightforward to derive deterministic continuous-time evolution equations for the 
frequencies {xi{t), yj{t)} with which the pure strategies i — 1, . . . , S are played by the respective players. One 
finds [51] 

= a^kUk - ^ Xkaktyi - AxW log - ^ log Xk , 

V fc fc£ / V k I 



^ y^iP X! "'^''^'^ ~ X! y^^-kexi - Xyj log y-i - ^ yk log yk ■ (A7) 



ke 



These equations are occasionally referred to as the Sato-Crutchfied equations, and it is worth pointing out 
that they reduce to the standard replicator equations for the case of learning without memory loss (A — 0). 
Furthermore their behaviour is solely determined by the ratio A//3. If this ratio is fixed then the role of the 



remaining parameter is merely to set the time scale. It is also easy to verify that the fixed points of Eqs. (AT) 



coincide with those of Eqs. (A5). The behaviour of these dynamics can be quite intricate, depending on the 
structure of the underlying game. Sato et al. have for example identified chaotic motion in modified versions 
of the celebrated rock-paper-scissors game 



Fig. 1 in the main text has been obtained from a numerical integration of Eqs. ( A7), using an Euler- forward 
scheme. We point out that it is hard to accurately determine the shape of cyclic attractors such as the one in 
the top-right panel of Fig. 1, even when integrating the dynamics up to large times of up to 5 • 10® and/or at 
small time stepping {dt « 10"'^). The cycle in Fig. 1 should therefore be understood as an illustration, rather 
than as a quantitative characterisation of the attractor. 



2. Analytical characterisation of stochastic cycles 



It is possible to make analytical progress and to compute the spectrum of the oscillations between cooperation 
and defection analytically in the limit of large, but finite batch sizes N. 
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We start from the dynamics of Eq. (3) , 



1 ^ 

Ak{t) ^ (l-A)Afe(t) + — ^ afc,^<,(t)' 

Q = l 
1 ^ 



;j„(t)' 



(A8) 



and note that the expression ^^=1 ^k,io,{t) on. the right-hand side is a random variable at finite batch sizes 
A''. The same is true for the analogous expression in the update rule for Bk- The mean value of ^a=i ^k,i^(t) 
is given by /ife(i) — a^jt/jit), given that the jait) are drawn from the mixed strategy profile y(t), i.e. action 

£g {ALLC, ALLD,TFT} occurs with frequency y£{t) on average. Similarly -^J2a=i ^fejc,(t) average 
of Ukit) = J2i O'kiXiit). Separating off fluctuations, and anticipating their scaling with N , we write 



"'k,^o.{t) = Y^akjVjit) 

1 ^ 



a = l 



Vkit). 



(A9) 



By means of the central limit theorem £,k{t) and r?fc(t) can, in the limit of large but finite TV, be approximated 
as Gaussian noise variables of mean zero and with the following correlations 

j 

i 

{Ut)mit')) = 0. (Aio) 

Here Su' = 1 for t = t' and Stf = otherwise. These expressions are obtained for example by writing 

"l ^ 

ITr X! °'k,i^(t) - y-kit) 



a = l 



(All) 



followed by a straightforward evaluation of the above correlators to the appropriate order in N and taking 
into account the statistics of the ia{t)- 



We can now proceed to insert these expressions into the map ( A5 ) and find 



X^it + l) = 



Vjit+l) 



l-A„/9E,afc,a,(t)+JV-i/25,(t)] 



Ek^kity-^e 



(t)] 



(A12) 



Given the presence of the noise terms ^k{t) and r]k{t), the mixed strategy profiles {xi{t), yj{t)} will be stochastic 
variables themselves. The next step is to self-consistently separate deterministic from stochastic contributions, 
and to derive a closed set of equations describing the evolution of fluctuations about the deterministic limit. 
To this end we write 



Xi{t) = Xi{t) + -^Xi{t), 

V N 



VN 



(A13) 



where the quantities with overlines represent the deterministic contributions, and quantities with tildes are 
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stochastic fluctuations. Eqs. (A12) can be written in the form 



/,(x(t),y(t),^(t)), 
5,(x(i),y(t),r7(t)), 



(AM) 



with suitable functions {fi,gj}- One proceeds by substituting (A13) on both sides of Eq. (A14|, followed by 
a systematic expansion in powers of N^^^^. To lowest order one finds 



+ = /.(xW,y(t),0), 
+ = .g,(x(i),y(t),0), 



(A15) 



i.e. one recovers the deterministic map (A5) 



While the calculation up to now apphes to any deterministic trajectory, we will from now on restrict the 
discussion to an asymptotic regime, and assume that the deterministic dynamics has reached a fixed point 
z* = (x*,y*). This is appropriate in the context of the present investigation, as we are interested in stochastic 
quasi-cycles about deterministic fixed points. Based on the restriction to deterministic fixed points further 
analytical progress is relatively straightforward ^ . 

In next-to- leading order of the expansion in powers of N^^^^ one has 



Xi{t+l) 



E 

k 

E 



9/i(x,y,0 



dxk 

^gj(x,y,$) 
dxk 



Xk{t) 



(x*,y*,0) 



Xk{t) 



(x%y,o) 



dyk 

%(x,y,g) 

dyk 



yk{t) \+K,{t) 



(x*,y*,0) 



(S* ,r,0) 



(A16) 



where 



k / 

Pj{i) = P (v*jVjit) ~y*'^y*kr]kit)^ . 



(A17) 



Writing z = {zi, . . . , zq) — (xi, a;2, xa, yi, 1/2, J/s), and using the notation z{t) = z* + N ^/^C(0 to separate 
deterministic from stochastic contributions = 2:2, 2;3, yi, 2/2, ys)) one has 



(A18) 



where J* is the 6x6 Jacobian matrix of the deterministic equations ( A5 1 , evaluated at the fixed point 



z* = (x*, y*). The variable cp = {tpi, . . . , (pe) = (ki, K2, K3, pi, p2, Ps) represents Gaussian noise, uncorrelated 
in time, but with cross-correlations between the different components: 



(A19) 



The elements of D* can be expressed in terms of the deterministic variables z. More precisely one has, using 
Eqs. (IaTtI), 



3 3 



x*iX* - x*x* ^ xl {^,^k) - x*x* xl {S,j^k) + x*x* Y ^kXe (^feCf 



fe=i 



fc=i i=i 



(A20) 



^ A full analytical characterisation of stochastic effects is possible also for periodic attractors of the deterministic dynamics, this 
has been discussed in the context of chemical reaction systems in [44) and [45] . Such approaches are based on Floquet theory, 
and we expected that they are applicable also in the learning scenario (with suitable modifications to accommodate the the 
discrete-time dynamics). This is beyond the scope of the work presented in this paper. We point out however that all equations 
up to |A22| are valid for any deterministic trajectory, provided the fixed point values z* in the relevant expressions are replaced 
by their time-dependent counterparts. 
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and 



D. 



i+3j+3 



y*y*j ivtVj) 



3 

fe=i 



3 

y*jyr^y*k{v3Vk) 

k=l 



yty- 



k=l i=l 



(A21) 



for i,j e {1,2,3}. The noise variables with a e {1,2,3} are uncorrelated from those with a G {4,5,6} 
so that the matrix D* is block diagonal (£>*^ and D^^ both va nish if a e {1,2,3} and b G {4,5,6}). The 



covariances of the noise variables {^k} and {rjk} are given by ( |A10[ ). 
deserves some attention here. 



One further potentially subtle point 



The covariance elements of the noise variables {^k} and {rjk} as given in 
(AlO) depend on the variables z{t) — {xi{t),X2{t),x{t),yi{t),y2{t),y3{t)). These in turn have deterministic 
and stochastic contributions, z — z* + N~^/^<^. Within our expansion in powers of N~^^ '^ it i s justified 
to self-consistently suppress the stochastic contributions N^^^^C to the variables z in Eq. (AlO I, as these 

we are working at. For the purposes of Eqs. (A20) 



contributions would not affect results to the order of N 



and ( A21 1 we therefore use 



{Vkit)m{t')) = Stt' {x* [ttki - K] [au - i^*e]} , 



(A22) 



where fj,* = J^j o^ijy] and v* = Y.i aji y*- 

Starting from the linear equation (A18) we now move to Fourier space and write Ca('^) for the Fourier 
transform of Ca{t) and similarly for the noise components ipa (a = 1, . . . , 6). One then has 



(A23) 



where M = e*"^! — J*. The notation I here indicates the 6x6 identity matrix. The power spectra of 
the components of C can then be obtained from Eq. (A23), taking into account (A19), i.e. the fact that 
{(pa{oj)'Pb{^')) = S{uj + a;')D*^. One then has 



(Mt 



(A24) 



be 



The right-hand-side can be evaluated numerically using the explicit form of the Jacobian J* and of the noise 
covariance matrix D*. These quantities only depend on the fixed point z* of the deterministic dynamics, which 
again can be obtained by numerical iteration of the map ( A5 ) , or as a numerical solution of the corresponding 
fixed point relations. 

Power spectra of this type are plotted in Fig. [3j It is im portant to note that these represent power spectra 
of the variables C: i-e. the pre-factor in Eq. (A13| have already been scaled out. The theory hence 



predicts that these re-scaled spectra are independent of the batch size N, which is why the different spectra 
in Fig. 3 collapse on one curve (with the exception of the = 1 case, at these batch sizes the theory does 
not apply). To put it in other words: in order to obtain the raw spectra of deviations from the deterministic 
fixed point the amplitude of the different spectra in Fig. 3 each need to be divided by N. It is then clear 
that in absolute terms fluctuations are larger for small batch sizes (e.g. A^ = 1) than for larger batches (e.g. 
A^ = 1000). The amplitude of fluctuations and of the cycles scales as N~^/^. 

We stress at this point that amplification mechanisms have been studied extensively in population-based 
models, chemical reaction systems, evolutionary game theory and epidemiology, see for example ^3 IT 1351110] . 
While the mechanism of amplification is similar to the one discussed here the source of the noise is different. 
Stochasticity in these population-based models arises when populations are finite, hence the term 'demographic 
stochasticity' [41j . The approach taken in the population models is based on a systematic expansion in the 
inverse square root of the system size. These techniques are originally due to van Kampcn 42 . In the 
learning system randomness instead comes from imperfect sampling of the opponent's mixed strategy, and the 
expansion parameter is the inverse square root of the number of observations made between strategy updates. 
Similar batch-size expansions have previously been applied to simpler games in [34j . 
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P=l ALLC ALLC p=l.5 




P=2 ALLC ALLC p=6 




FIG. 5: (Color on-line) Comparison of discrete and continuous-time deterministic learning. We show trajectories of 
the dynamics at fixed A//3 = 0.1, started from homogeneous initial conditions, x{t = 0) = y{t — 0). The black line in 
each simplex is obtained from /3 — 0.01, and represents the continuous-time limit. The symbols are for P — 1 (upper 
left panel), P = 1.5 (upper right), 13 — 2 (lower left) and /3 = 6 (lower right). In each panel we show the trajectory in 
the strategy simplex, as well as the corresponding time series of the propensity, xi{t) of playing ALLC. 

3. Comparison of continuous-time and discrete-time deterministic dynamics 

The modified replicator equations, suggested by Sato and Crutchfield [55| are differential equations and as 
such describe a continuous-time learning process. This approximation is vaUd for /3 ^ 1. The behaviour 
of discrete-time deterministic learning can however be quite different from this continuous-time limit, as 
illustrated in Fig. [sj We here fix the ratio A//3 and consider the behaviour at different values of (3. For small /? 
the discrete-time maps behaves essentially like the continuous dynamics, and has a stable spiral fixed point. As 
P is increased however, this fixed point becomes unstable, and a cyclic attractor develops^. Further increasing 
j3 enlarges the cycle, until its the attractor finally becomes a rather large triangular shaped object as depicted 



While the attractor of the dynamics appears to be a closed cyclic object, it is hard to determine numerically whether the 
trajectory is actually periodic, as we cannot exclude small drifts. The attractors plotted in the figure may therefore be invariant 
curves of the map, rather than actual cycles. 
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FIG. 6: (Color on-line) Efltects of initial conditions. Left: Attractor obtained from running the deterministic map ( A5 I 
starting from homogeneous initial conditions. The lower panel shows the ALLC-component of the mixed strategy of 
each player, the trajectories of both players are identical, x(t) = y{t). Right: Attractor obtained from inhomogeneous 
initial conditions, x(i = 0) 7^ y{t — 0). Lower panel shows that the ALLC components of both players are not identical, 
but that there is a relative shift in time, x(f) = y{t — At). Parameters are /3 — 0.01, A — 0.00275 in both panels. 



in panel d). It is here important to note that even though the attractor set looks smooth, the dynamics does 
not revolve around the attractor in a continuous motion. We expect that more complicated behaviour, such as 
chaotic attractors, will in principle be possible, even though we have not observed them for the present game 
and the present learning dynamics. Other learning rules in similar games have however been shown to admit 
chaotic motion, see [5^ . 



4. Inhomogeneous initial conditions 



The map defined by Eqs. ( A5 1 describes the coupled dynamics between the two players. It is the analogue of 
a two-population replicator equation in evolutionary dynamics. If started from homogeneous initial conditions, 
x(t = 0) = y{t — 0), the deterministic map will operate in the space in which x(t) = y{t), i.e. both 
players will play identical mixed strategies. This is not generally the case for the stochastic dynamics, as the 
randomness in the players' decisions will break the symmetry. We find that starting the deterministic map 
from inhomogeneous initial conditions (x(t — 0) y{t = 0)) may affect the resulting attractors, see Fig.[6| 



5. Effect of selection intensity on stochastic dynamics 

The role of the intensity of selection, /3, on the stochastic dynamics is illustrated in Fig. [Tj The ratio 
A//3 is the same as in Fig. 4, but with an increased value of /3. Comparing Fig. [t] with the upper panels 
of Fig. 4 shows that an increase of the selection intensity drives the deterministic dynamics to a cycle, and 
the stochastic dynamics towards the edges of the strategy simplex, especially at small batch sizes N (strong 
noise). A behaviour not too dissimilar from that of the corresponding evolutionary system emerges, c.f. Fig. 
2a and 2c of [3D]. 
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FIG. 7: (Color on-line) Effect of intensity of selection in stochastic dynamics. Figure shows the attractor of deterministic 
learning (black curves, started from inhomogeneous initial conditions) along with distributions obtained from stochastic 
learning at TV = 1 (left) and A'' = 1000 (right) for /? = 1 and A = 0.04. The ratio A//3 is thus as in the panels in the 
upper row of Fig. 4, but with the intensity of selection increased tenfold. 



6. Dominance of TFT in stochastic learning at low memory-loss 



In |30j it was reported that evolutionary dynamics in the limit of small mutation rates chooses defection at 
infinite population sizes, but that a finite population of a suitable size can instead choose reciprocity (TFT). 
An analogous effect is seen in adaptive learning, as illustrated in Fig. [8) We here choose a relatively small 
memory-loss rate A, and, as a function of the batch size N, we measure the frequencies with which each of 
the three pure strategies are played asymptotically in the learning dynamics. While ALLD dominates in the 
deterministic limit of large N, TFT, i.e. reciprocity is the most frequently used pure strategy in the strongly 
stochastic case of small batches. 



7. Asynchronous updating 

The batch dynamics assumes that both players Alice and Bob update their attractions and mixed strategies 
synchronously once every iV rounds of the game. This assumption was made mainly to simplify analytical 
approaches. In this section we briefly show that asynchronous updating does not alter the picture of coherent 
stochastic cycles. In Fig. [9] we show time series and power spectra resulting from a learning dynamics in 
which each player independently updates with probability 1/N after each individual round of the game. I.e. 
one round of the iterated prisoner's dilemma is played, and then for each player it is determined whether or 
not an update of the player's attractions and mixed strategy occurs (this happens with probability 1/A'^), or 
whether no update is performed (this happens with probability 1 — 1/N). In each update all rounds of the 
game since the last update are taken into account. Imposing this dynamics each player performs updates on 
average every N iterations, but not synchronized with the other player. As seen in the figure coherent cycles 
are found at finite batches as before. 



8. Other learning rules 



In this section we briefly consider other learning rules, in particular the experienced-weighed attraction 
(EWA) learning proposed in [551 US] ■ The EWA update follows an algorithm not too dissimilar form the one 
discussed in the main body of this paper. In particular decisions are based on a logit rule, i.e. we have as 
before 

e/3'4,(t) g/3B.(t) 
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FIG. 8: (Color on-line) Temporal average of the player's mixed strategies in stochastic learning as a function of the 
batch size. Parameters are fixed at ^ = 0.01, A = 10"*. At small batch sizes (strong noise) the player's strategies are 
dominated by TFT, at larger values of A*' defection is played most frequently. Data is from simulations, run for 100, 000 
time steps, measurements performed in the second half of this interval, data averaged over 400 samples. 



EWA learning uses a the following update rule for the attractions qi{t) and ri(t): 

(1 - X)Z{t - l)Ak{t - 1) + [<5 + (1 - <5)/(z(<), fc)] a,,,(t) 



Ak{t) 



Z{t) 



„ (1 - X)Z{t - l)Bk{t -l) + [5+{l- d)I{j{t), k)] afc,,(,) 

Bdt) = ■ (A26) 

Here i(t) is the action taken by player X at round t, and j{t) the action of player y in round t. !{■,■) 
indicates the Kronecker function, i.e. = 1 for i — j, and = otherwise. The normalisation in the 

denominator is updated as 



Z{t) = {1- X){1- K)Z{t-l) + 1. (A27) 



We note that <j) in the notation of |2Zl equal to = 1 — A in our notation. Eqs. ( A26) correspond to 

on-line learning of batch size = 1. One generalisation to batches of size N is given by 

. (1 - X)Z{t - l)Ak{t - 1) + Ell + (1 - mia{t),k)] ak,,^it) 

^'^^^^^^ = W) 

(1 - A)Z(t - l)Bk(t - 1) + N-^ Ell ['^ + (1 - mUt)M afc,,„(t) 
Bk{t + 1) = . (A28) 

The parameters I3^4i = 1 — A, k and 5 have been fitted to results from real-world experiments on human subjects 
in [53]. We here focus on the choices (5 — \, k. — 0.75 and — 0.8, these are roughly consistent with values 
reported in see in particular Table 4 of this reference. It is here important to note that the specific values 
of these parameter estimates may depend on the detailed experimental protocol, and more importantly on the 
game under consideration. The purpose of the present section is to show that amplified stochastic oscillations 
may occur in principle in EWA learning, a more detailed analysis in dependence on model parameters is left 
for future work. 
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FIG. 9: (Color on-line) Stochastic cycles in asynchronous updating. We show the power spectrum of fluctuations of 
the propensity to play ALLC (main panel, averaged over 1000 independent runs) as well as a time series from one 
individual run (inset). Parameters are /3 = 0.01, A = 0.001, the batch size is A'^ = 10. 



9. The case 5 = 1 



We first investigate the case S = 1, in which all foregone payoffs are re-inforced, i.e. the attractors of all pure 
strategies are updated, even those of strategies that have not actually been played. Results are shown in Fig. 



10 and as seen in the figure the behaviour of the EWA learning model for these parameters is very similar to 
that of the simplified model of the main body of the paper. The power spectra in the right panel confirm the 
existence of amplified stochastic oscillations, individual trajectories at different batch sizes are shown in the 
left-hand panels. The spectra shown in Fig. [TOjare obtained from fluctuations about the mean of time series, 
and that these fluctuations have been re-scaled to take into account the 1/VN nature of their magnitude. 



10. The case S <1 



The case (5 < 1 is discussed briefly in Fig. 11 Here the strategic choices actually taken are re-inforced with a 



stronger weight than those which were not played. We find that oscillations persist, provided S is not too small. 



the power spectra of fluctuations maintain their maxima at non-zero characteristic frequencies (see Fig. 11). 
At values of 6 smaller than some threshold value (which appears to depend on the other model parameters) , no 
oscillations are found. Near 6 — the dynamics may even converge to pure actions. A further more detailed 
analysis of the EWA model is possible based on the techniques developed in [34] and the present paper. In 
particular the analytical methods of Sec. |A 2 1 will allow for a detailed study of the regions of parameter space 
in which cycles between co-operation and reciprocity are to be expected. This is will be the topic of future 
work. 
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FIG. 10: (Color on-line) Left: Trajectories from single runs of the EWA learning process at /3 = 1, k = 0.75, (j> = 0.8, S = 
1. The curves show the propensity, xallc to use ALLC. Right: Corresponding re-scaled power spectra. Simulations 
are here averaged over at least 100 samples. 




FIG. 11: (Color on-line) The case S < 1. Other parameters as in Fig. 10 The batch size is A'' = 1. The main panel 
shows power spectra obtained from time series xallc {t) (averaged over 1000 samples), the insets show trajectories 
XALLc{t) from individual runs at 5 = 0.9, 0.8, 0.7 (from top to bottom). 



